Retrieving Japanese specialized terms and corpora from the World Wide Web
نویسندگان
چکیده
The BootCaT toolkit (Baroni and Bernardini, 2004) is a suite of perl programs implementing a procedure to bootstrap specialized corpora and terms from the web using minimal knowledge sources. In this paper, we report ongoing work in which we apply the BootCaT procedure to a Japanese corpus and term extraction task in the hotel terminology domain. The results of our experiments are very encouraging, indicating that the BootCaT procedure can be successfully applied, with relatively small modifications, to a language very different from English and the other Indo-European languages on which we tested the procedure originally.
منابع مشابه
Building general- and special-purpose corpora by Web crawling
The Web is a potentially unlimited source of linguistic data; however, commercial search engines are not the best way for linguists to gather data from it. In this paper, we present a procedure to build language corpora by crawling and postprocessing Web data. We describe the construction of a very large Italian general-purpose Web corpus (almost 2 billion words) and a specialized Japanese “blo...
متن کاملInternational Workshop Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning
TerminoWeb is a web-based platform designed to find and explore specialized domain knowledge on the Web. An important aspect of this exploration is the discovery of domain-specific collocations on the Web and their presentation in a concordancer to provide contextual information. Such information is valuable to a translator or a language learner presented with a source text containing a specifi...
متن کاملPutting the „Wisdom of Crowds“ to Use in NLP: Collaboratively Constructed Semantic Resources on the Web
Since early 90 ies, the Web has served as a unique corpus with background knowledge for various NLP tasks. The Web as a corpus has been employed in three principal ways: (i) obtaining Web based frequencies for specific terms and constructions, (ii) collecting term specific Web corpora by retrieving the corresponding text snippets, and finally (iii) constructing task and domain targeted corpora ...
متن کاملCompilation of Specialized Comparable Corpora in French and Japanese
We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The ...
متن کاملAn Intelligent Multilingual Information Browsing and Retrieval System Using Information Extraction
In this paper, we describe our multilingual (or cross-linguistic) information browsing and retrieval system, which is aimed at monolingual users who are interested in information from multiple language sources. The system takes advantage of information extraction (IE) technology in novel ways to improve the accuracy o f cross-linguistic retrieval and to provide innovative methods for browsing a...
متن کامل